Introduction

This step of the BDC workflow extracts the collection year whenever possible from complete and legitimate date information, and flags dubious (e.g., 07/07/10), illegitimate (e.g., 1300, 2100), or not supplied (e.g., 0 or NA) collecting year.


Important:

The results of VALIDATION test used to flag data quality is appended in separate fields in this database and retrieved as TRUE or FALSE, in which the former indicates correct records and the latter potentially problematic or suspect records.

Installation

You can install the released version of ‘BDC’ from github with:

if (!require("remotes")) install.packages("remotes")
if (!require("bdc")) remotes::install_github("brunobrr/bdc")

Creating folders to save the results.

Read the database

Read the database created in the *Space**](https://brunobrr.github.io/bdc/articles/03_space.html) step of the BDC workflow. It is also possible to read any datasets containing the **required** fields to run the workflow (more details here.](https://brunobrr.github.io/bdc/articles/integrate_datasets.html%22).)

database <-
  qs::qread("Output/Intermediate/03_space_database.qs")

Standardization of character encoding.

for (i in 1:ncol(database)){
  if(is.character(database[,i])){
    Encoding(database[,i]) <- "UTF-8"
  }
}


1 - Records lacking event date information

VALIDATION. This function flags records lacking event date information (e.g., empty or NA).

check_time <-
  bdc_eventDate_empty(data = database, eventDate = "verbatimEventDate")
#> 
#> bdc_eventDate_empty:
#> Flagged 3179 records.
#> One column was added to the database.

2 - Extract year from event date

ENRICHMENT. This function extracts four-digit year from unambiguously interpretable collecting dates.

check_time <-
  bdc_year_from_eventDate(data = check_time, eventDate = "verbatimEventDate")
#> 
#> bdc_year_from_eventDate:
#> Four-digit year were extracted from 2933 records.

3 - Records with out-of-range collecting year

VALIDATION. This function identifies records with illegitimate or potentially imprecise collecting year. The year provided can be out-of-range (e.g., in the future) or collected before a specified year supplied by the user (e.g., 1900). Older records are more likely to be imprecise due to the locality-derived geo-referencing process.

check_time <-
  bdc_year_outOfRange(data = check_time,
                      eventDate = "year",
                      year_threshold = 1900)
#> 
#> bdc_year_outOfRange:
#> Flagged 12 records.
#> One column was added to the database.

Report

Creating a column named “.summary” summarizing the results of all VALIDATION tests. This column is FALSE if any test was flagged as “FALSE” (i.e. potentially invalid or suspect record).

check_time <- bdc_summary_col(data = check_time)
#> Column '.summary' already exist. It will be updated
#> 
#> bdc_summary_col:
#> Flagged 3481 records.
#> One column was added to the database.



Creating a report summarizing the results of all tests.

report <-
  bdc_create_report(data = check_time,
                    database_id = "database_id",
                    workflow_step = c("prefilter", "taxonomy", "space", "time"))
#> 
#> bdc_create_report:
#> Check the report summarizing the results of the prefilter in:
#> Output/Report
#> 
#> bdc_create_report:
#> Check the report summarizing the results of the taxonomy in:
#> Output/Report
#> 
#> bdc_create_report:
#> Check the report summarizing the results of the space in:
#> Output/Report
#> 
#> bdc_create_report:
#> Check the report summarizing the results of the time in:
#> Output/Report

report


Figures

Creating a histogram showing the number of records collecting over the years.

bdc_create_figures(data = check_time,
                   database_id = "database_id",
                   workflow_step = "time")
#> Check figures in C:/Users/Bruno Ribeiro/Documents/bdc/vignettes/Output/Figures


Number of records sampled over the years


Summary of all tests of the time step; note that some database lack event date information


Summary of all validation tests of the BDC workflow


Save a “raw” database

Save the original database containing the results of all data quality tests appended in separate columns.

check_time %>%
  qs::qsave(.,
            here::here("Output", "Intermediate", "04_time_raw_database.qs"))

Filter the database

Let’s remove potentially erroneous or suspect records flagged by the data quality tests applied in all steps of the BDC workflow to get a “clean”, “fitness-for-use” database.

output <-
  check_time %>%
  dplyr::filter(.summary == TRUE) %>%
  bdc_filter_out_flags(data = ., col_to_remove = "all")
#> 
#> bdc_fiter_out_flags:
#> The following columns were removed from the database:
#> .uncer_terms, .val, .equ, .zer, .cap, .cen, .urb, .otl, .gbf, .inst, .dpl, .rou, .eventDate_empty, .year_outOfRange, .summary

Save a “fitness-for-use” database

output %>%
  qs::qsave(.,
            here::here("Output", "Intermediate", "04_time_clean_database.qs"))